Exercise - Validation Metrics for Classification

  1. Load the data (train and test data)
  2. Fit the Logistic Regression
  3. (ASSIGNMENT) Check the accuracy and the AU ROC
  4. Visualize the ROC curve
  5. Discuss metric results

NOTE: Run all cells until the TASK 1 (do not make changes)

By: Hugo Lopes
Learning Unit 11


In [ ]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.linear_model import LogisticRegression

from sklearn.metrics import accuracy_score, precision_score, \
    recall_score, f1_score, roc_auc_score, roc_curve, confusion_matrix
from sklearn.model_selection import train_test_split
%matplotlib inline

In [ ]:
def plot_roc_curve(roc_auc, fpr, tpr):
    # Function to plot ROC Curve
    # Inputs: 
    #     roc_auc - AU ROC value (float)
    #     fpr - false positive rate (output of roc_curve()) array
    #     tpr - true positive rate (output of roc_curve()) array
    
    plt.figure(figsize=(8,6))
    lw = 2
    plt.plot(fpr, tpr, color='orange', lw=lw, label='ROC curve (AUROC = %0.2f)' % roc_auc)
    plt.plot([0, 1], [0, 1], color='navy', lw=lw, linestyle='--', label='random')
    plt.xlim([0.0, 1.0])
    plt.ylim([0.0, 1.05])
    plt.grid()
    plt.xlabel('False Positive Rate')
    plt.ylabel('True Positive Rate')
    plt.title('Receiver operating characteristic example')
    plt.legend(loc="lower right")
    plt.show()

Load an example dataset

Data already prepared for a classifier


In [ ]:
df = pd.read_csv('../data/exercise_dataset_LU11.csv')
print('Shape:', df.shape)
df.head()

Divide into Train and Test sets:

  • X_train: train data
  • y_train: target of train data
  • X_test: test data
  • y_test: target of test data

In [ ]:
X_train, X_test, y_train, y_test = train_test_split(df.iloc[:, 1:], 
                                                    df.iloc[:, 0], 
                                                    test_size=0.33, 
                                                    random_state=42)

Task 1: Fit the LogisticRegression() with the Train Set


In [ ]:
# Code here:

Task 2: Get the predictions & scores/probas on the Test Set


In [ ]:
# Code here:

Task 3: Get the Accuracy score & AU ROC & ROC Curve


In [ ]:
# Code here for accuracy score, AU ROC:

In [ ]:
# Code here for ROC curve:

# Call plot_roc_curve():

Task 4: Discuss the results

What do you think about the AU ROC? And what about the accuracy score? Do you think it is high?
Hint: take a look at the class balance.